Our locking mechanism was designed and implemented with cfengine version 1.4.x in mind. Cfengine is a descriptive language and a configuration robot which can perform distributed system administration on large networks[6,7,8,9]. A cfengine program is generally scheduled as a cron job, but can also be initiated interactively or by remote network connection. Cfengine can examine many hundreds of files, system processes and launch dozens of user scripts depending on the time of day and the host concerned. Cfengine's job is to coordinate these activities based the state of the system. The state comprises many variables based on host type, date, time, and the present condition of the host as compared to a reference model. Its total run time involves too many variables to be practically predictable.
Opening cfengine to the network places an extra onus on its behaviour with respect to scheduling. Although designed in such as way that it does not give away any rights to outside users, cfengine is intentionally constructed so that general users (not just root) can be allowed to execute the standard configuration in order to update or diagnose the system, even when human administrators are not available. The mere thought of this is enough to send convulsions down the spines of many system folks, and it would indeed be a cause for concern unless measures were incorporated to protect such a provision from abuse. Adaptive locks will therefore play a central role in a `connected' cfengine environment in the future.
We have tested our adaptive locks with cron initiated cfengine as well as with remote connections with some success. The locks do indeed fulfill their role in preventing seizures and overlaps which can occur due to unforeseen delays. The locking of individual atoms means that, even though a particular script might run over its allotted time, other scripts and tasks can be completed without delay, come the next scheduling time. Silly mistakes can also be dealt with unproblematically: a cfengine program which starts itself is impervious to the apparent recursive well, provided the IfElapsed parameter is not set to zero. This is, after all, simply an example of spamming (see below).
Adaptive locks are very important for cfengine: cfengine is a tool which is supposed to automate basic system administration tasks and work as a front end for user-scripts, allowing administrators to collect an entire network's scripts into a single place and providing a net-wide front-end for cron. In order to be effective in this role, cfengine must support a high degree of autonomy. Cfengine atomizes operations in different ways. Some operations, such as file editing and script execution, are locked on a per-file basis. Other operations which could involve large scale traversals of the file system are locked per class of operation. The aim of the locking policy is to make the system safe and efficient – i.e. not to overload to the system with contrary tasks.
Previously, cfengine processes were locked by a single global lock. If a process were interrupted for some reason, a hanging lock would remain and cause warning messages to be printed from the affected hosts the next time cfengine was scheduled. Certain cfengine processes would overrun their allotted time: typically the weekly runs which perform extensive system checking and updates of system databases. This would happen once a week, generating useless mail which everyone would have been happier not to receive. Bad NFS connections through buggy kernels have been known to hang scripts. Also, bugs in cfengine itself, which manifest themselves only under special conditions, could result in a core dump and a hanging lock. Although each isolated occurrance of these problems was relatively rare, the cumulative effect on a large network could be substantial enough to be an irritation. The system administrator would then be required to chase after these old locks and remove them.
The new locks allow several cfengines to coexist as different processes, without interference. Moreover, since one of the purposes of automation is to minimize the amount of fruitless messages from the system, the original locking policy was clearly not in tune with the cfengine's autonomous philosophy. Using the new adaptive locks, cfengine can clean up its own hanging processes without the intervention of a human, and even better: silently. In large network environments such silence is golden.
With large scale system checking, the total number of locks used in a
single pass of cfengine might approach several tens or even a hundred
on an busy system, but only one active lock is present per active
thread. (We do not normally expect more than two threads for normal
system administration tasks.) The anti-spamming locks take up only a
single inode each and since most file systems have thousands of spare
inodes, this usage is hardly a concern. The first part of table
shows runtimes for a small cfengine run which sets 24
locks, while the last part shows a run which sets 32 locks. Some of
the operations involved in the second run are large. Although the
difference in real time seems large for the smaller run, the
difference in user and system time is much smaller. The actual CPU
time spent to set and remove the locks is not high, which means that
we wait for the disk when creating and deleting the locks. For the
larger run, the differences are almost the same, but here the
dominating part of the run is the cfengine operations itself, not the
administration of locks.
Cfengine can be exposed to infinite loops from which it will recover
gracefully. Figure illustrates a cfengine program which
calls itself. Suppose we have a cfengine program which contains three
atomic operations A, B and C. Suppose also that B is a
shell command which executes cfengine. Let us then examine how the
locks handle the execution of this program, assuming i) that the
scripts have not been executed for a long time >IfElapsed and
ii) that the locking parameters have `sensible' values.
![]() |
The example in figure shows how no more than two cfengine
processes will be started. When cfengine is first started, it executes
atom A, locking and unlocking it normally. When it arrives at B,
a lock is acquired to run cfengine recursively (since this has not
previously occurred) and a second cfengine process proceeds to
run. Under the auspices of this second process, a new lock is
requested for A, but this fails since it is too soon since the last
instance of A from process #1. Next a lock is requested for B,
but this also fails because B is busy and not enough time has passed
for it to expire. Thus we come to C. Since C has not been
executed, cfengine obtains a lock for C and executes it to
completion, then releasing the lock. Process #2 is then complete and so
is atom B from cfengine process #1. The lock for B is released and
cfengine attempts to finish process #1 by getting a lock for C. This
fails however, since C was just executed by the process #2 and not
enough time has elapsed for it to be restarted (or killed). The first
process is then complete.
Notice how two processes flow through one another. The real work in A and C (which could have been done by a single process) simply gets shared between two processes, and no harm is done.
A similar sequence of events occurs if a process hangs while executing
an atom (see figure ). Suppose that an old instantiation
of (process #1) managed to execute A successfully, but hung while
executing atom B. Later, after the lock on B has expired, another
cfengine (process #2) will execute A again, kill the previous lock on
B and execute B, then execute C. Here we assume that B hangs
for some spurious reason, not because of any fundamental problem with
B.
![]() |
Similar scenarios can be constructed with remote connections and more convoluted loops. All of these either reduce to the examples above or are defeated by cfengine's refusal to copy from a host to itself via the network (local copying without socket waits is used instead). Spamming attacks from malicious users are stifled by the same anti-spamming locks.
For various reasons our implementation of locks in cfengine includes logging of lock behaviour. This allows us to trace the executing of scripts and other atoms in a cfengine program and gain an impression of how long the individual elements took to complete. This information could then be fed back into the locking mechanism to optimize the parameters IfElapsed and ExpireAfter.